Example 1: Applying a Data Pipeline to Multiple DataFrames¶

  • This example demonstrates how to use the DataPipeline class from the src.data.dataset module to apply a sequence of transformations to multiple pandas DataFrames.
  • The pipeline is initialized using a configuration file (pipeline.yaml), and the transformations are applied to a list of sample DataFrames representing structured data with fields such as radius, volume, and other.
  • The output displays the original data and the transformed results.
InĀ [1]:
import pandas as pd
from src.data import dataset

# Sample data
data1 = pd.DataFrame({
    "radius": [1, 2, 3, 4],
    "volume": [50, 150, 30, 200],
    "other": [9, 8, 7, 6]
})

data2 = pd.DataFrame({
    "radius": [5, 6, 7, 8],
    "volume": [80, 90, 120, 40],
    "other": [1, 2, 3, 4]
})

data = [data1, data2]
print(data)

# 初始化 DataPipeline å¹¶åŗ”ē”Øę•°ę®č½¬ę¢
pipeline = dataset.DataPipeline(config_path="settings/pipeline.yaml")
transformed_data = pipeline.apply(data)

print("Transformed Data:")
print(transformed_data)
2025-01-03 10:10:57,136 - INFO - dataset.py - Loaded the YAML configuration file <_io.TextIOWrapper name='settings/pipeline.yaml' mode='r' encoding='utf-8'>.
2025-01-03 10:10:57,137 - INFO - dataset.py - Building data transformation pipeline.
2025-01-03 10:10:57,137 - INFO - dataset.py - Added step 'choose_columns' to the pipeline.
2025-01-03 10:10:57,138 - INFO - dataset.py - Added step 'delete_outlier_in_volume' to the pipeline.
2025-01-03 10:10:57,139 - INFO - dataset.py - Added step 'concat_dataframes' to the pipeline.
2025-01-03 10:10:57,139 - INFO - dataset.py - Data transformation pipeline built successfully.
2025-01-03 10:10:57,139 - INFO - dataset.py - Applying the data transformation pipeline.
2025-01-03 10:10:57,141 - INFO - dataset.py - Selected columns from 2 DataFrames.
2025-01-03 10:10:57,141 - INFO - dataset.py - Filtered 2 DataFrames.
2025-01-03 10:10:57,142 - INFO - dataset.py - Concatenating DataFrames with join type 'inner'.
2025-01-03 10:10:57,143 - INFO - dataset.py - Concatenated 2 DataFrames into one DataFrame withshape (5, 2).
2025-01-03 10:10:57,143 - INFO - dataset.py - Data transformation pipeline applied successfully.
[   radius  volume  other
0       1      50      9
1       2     150      8
2       3      30      7
3       4     200      6,    radius  volume  other
0       5      80      1
1       6      90      2
2       7     120      3
3       8      40      4]
Transformed Data:
   radius  volume
0       1      50
1       3      30
2       5      80
3       6      90
4       8      40

Example 2: Image Processing Pipeline with Visualization¶

  • This example demonstrates how to use the ImagePipeline class from the src.data.dataset module to apply image processing transformations to a set of sample images.
  • The pipeline is configured via the image_pipeline.yaml file and processes a list of mock images (e.g., a white image and a black image).
  • Additionally, the ImageVisualizer class from src.utils.plotters is used to visualize the images before and after processing.
  • The output includes the shapes of the processed images and their visual representations.
InĀ [2]:
import numpy as np
from src.data import dataset
from src.utils import plotters

# Sample images (mock data for illustration)
image1 = np.ones((100, 100, 3), dtype=np.uint8) * 255  # White image
image2 = np.zeros((100, 100, 3), dtype=np.uint8)      # Black image

images = [image1, image2]

visualizer = plotters.ImageVisualizer(width=400, height=400)
visualizer.display_images(images)

# Initialize ImagePipeline and apply image processing
pipeline = dataset.ImagePipeline(config_path="settings/image_pipeline.yaml")
processed_images = [pipeline.apply(img) for img in images]

print("Processed Images:")
for idx, img in enumerate(processed_images):
    print(f"Image {idx+1}: Shape={img.shape}")

visualizer.display_images(processed_images)
Displaying Image 1: Shape=(100, 100, 3)
Displaying Image 2: Shape=(100, 100, 3)
2025-01-03 10:10:57,910 - INFO - dataset.py - Building image processing pipeline.
2025-01-03 10:10:57,911 - INFO - dataset.py - Added step 'resize' to the pipeline.
2025-01-03 10:10:57,912 - INFO - dataset.py - Added step 'normalize' to the pipeline.
2025-01-03 10:10:57,913 - INFO - dataset.py - Added step 'convert_color' to the pipeline.
2025-01-03 10:10:57,913 - INFO - dataset.py - Image processing pipeline built successfully.
2025-01-03 10:10:57,914 - INFO - dataset.py - Applying the image processing pipeline.
2025-01-03 10:10:57,915 - INFO - dataset.py - Resized image to 100x100.
2025-01-03 10:10:57,916 - INFO - dataset.py - Normalized image.
2025-01-03 10:10:57,916 - INFO - dataset.py - Converted image color using code 6.
2025-01-03 10:10:57,917 - INFO - dataset.py - Image processing pipeline applied successfully.
2025-01-03 10:10:57,917 - INFO - dataset.py - Applying the image processing pipeline.
2025-01-03 10:10:57,918 - INFO - dataset.py - Resized image to 100x100.
2025-01-03 10:10:57,919 - INFO - dataset.py - Normalized image.
2025-01-03 10:10:57,919 - INFO - dataset.py - Converted image color using code 6.
2025-01-03 10:10:57,920 - INFO - dataset.py - Image processing pipeline applied successfully.
Processed Images:
Image 1: Shape=(100, 100)
Image 2: Shape=(100, 100)
Displaying Image 1: Shape=(100, 100)
Displaying Image 2: Shape=(100, 100)

Save as html file¶

InĀ [6]:
import plotly
plotly.offline.init_notebook_mode()
InĀ [7]:
%%time
## Save as html file
from src.utils import utils

source_notebook = "./notebooks/01_process_pipeline_dataframe.ipynb"
target_folder = "./docs/reports"

converter = utils.get_notebook_converter(
    source_notebook, target_folder, additional_pdf=False
)
converter.convert()
last save time: 2025-01-03 11:57:42.451276
Converting Notebook file: ./notebooks/01_process_pipeline_dataframe.ipynb
HTML with embedded Plotly figures saved to docs\reports\01_process_pipeline_dataframe.html
CPU times: total: 297 ms
Wall time: 350 ms
d:\Foester\initial preparation\pipeline\.venv\share\jupyter\nbconvert\templates\base\display_priority.j2:32: UserWarning:

Your element with mimetype(s) dict_keys(['application/vnd.plotly.v1+json']) is not able to be represented.